2013 Coal Mine Production

September 04, 2017 18:44 - E. Asa Bour - bourea@hotmail.com - Predict production



In [1]:

    
%matplotlib inline

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import explained_variance_score, r2_score, mean_squared_error

sns.set()









    



C:\Anaconda2\envs\py2\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

Cleaned this data in ../deliver/Data_Cleaning.ipynb



In [2]:

    
df = pd.read_csv("../data/cleaned_coalpublic2013.csv", index_col="MSHA ID")

df[['Year', 'Mine_Name']].head()









    Out[2]:







  
    
      
      Year
      Mine_Name
    
    
      MSHA ID
      
      
    
  
  
    
      103381
      2013
      Tacoa Highwall Miner
    
    
      103404
      2013
      Reid School Mine
    
    
      100759
      2013
      North River #1 Underground Min
    
    
      103246
      2013
      Bear Creek
    
    
      103451
      2013
      Knight Mine

Predict the production of coal mines



In [3]:

    
features = ['Average_Employees',
            'Labor_Hours']

categorials = ['Mine_State',
               'Mine_County',
               'Mine_Status',
               'Mine_Type',
               'Company_Type',
               'Operation_Type',
               'Operating_Company',
               'Operating_Company_Address',
               'Union_Code',
               'Coal_Supply_Region']

target = 'log_production'



In [4]:

    
sns.set_context('poster')
fig = plt.subplots(figsize=(14,8))
sns.violinplot(y='Mine_Status', x="log_production", data=df, split=True, inner="stick");
plt.tight_layout()
plt.savefig("../figures/Coal_Production_Prediction_Mine_Status_vs_log_Production.png")



In [5]:

    
sns.set_context('poster')
fig = plt.subplots(figsize=(14,8))
sns.violinplot(y='Company_Type', x="log_production", data=df, split=True, inner="stick");
plt.tight_layout()
plt.savefig("../figures/Coal_Production_Prediction_Company_Type_vs_log_Production.png")



In [6]:

    
dummy_categoricals = []
for categorical in categorials:
    print categorical, len(df[categorical].unique())
    # avoid the dummy variable trap!
    drop_var = sorted(df[categorical].unique())[-1]
    temp_df = pd.get_dummies(df[categorical], prefix=categorical)
    df = pd.concat([df, temp_df], axis=1)
    temp_df.drop('_'.join([categorical, str(drop_var)]), axis=1, inplace=True)
    dummy_categoricals += temp_df.columns.tolist()









    



Mine_State 29
Mine_County 164
Mine_Status 5
Mine_Type 3
Company_Type 3
Operation_Type 2
Operating_Company 752
Operating_Company_Address 692
Union_Code 7
Coal_Supply_Region 8

Build random forest regressor



In [7]:

    
train, test = train_test_split(df, test_size=0.3)



In [8]:

    
rf = RandomForestRegressor(n_estimators=100, oob_score=True)
rf.fit(train[features + dummy_categoricals], train[target])









    Out[8]:





RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=100, n_jobs=1, oob_score=True, random_state=None,
           verbose=0, warm_start=False)



In [9]:

    
fig = plt.subplots(figsize=(8,8))
sns.regplot(test[target], rf.predict(test[features + dummy_categoricals]), color='green')
plt.ylabel('Predicted Production')
plt.xlim(0,22)
plt.ylim(0,22)
plt.tight_layout()
plt.savefig("../figures/Coal_Production_Prediction_RF_Prediction.png")



In [10]:

    
predicted = rf.predict(test[features + dummy_categoricals])
print "R^2 score", r2_score(test[target], predicted)









    



R^2 score 0.900442373451



In [11]:

    
print "Explained variance", explained_variance_score(test[target], predicted)









    



Explained variance 0.900442722355



In [12]:

    
print "MSE", mean_squared_error(test[target], predicted)









    



MSE 0.517706514947



In [13]:

    
rf_importances = pd.DataFrame({'name':train[features + dummy_categoricals].columns, 
                               'importance':rf.feature_importances_
                             }).sort_values(by='importance', ascending=False).reset_index(drop=True)
rf_importances.head(5)









    Out[13]:







  
    
      
      importance
      name
    
  
  
    
      0
      0.803333
      Labor_Hours
    
    
      1
      0.069395
      Average_Employees
    
    
      2
      0.009607
      Coal_Supply_Region_Powder River Basin
    
    
      3
      0.006403
      Mine_Type_Surface
    
    
      4
      0.003830
      Coal_Supply_Region_Illinois Basin

Conclusion

Fun example

	Year	Mine_Name
MSHA ID
103381	2013	Tacoa Highwall Miner
103404	2013	Reid School Mine
100759	2013	North River #1 Underground Min
103246	2013	Bear Creek
103451	2013	Knight Mine

	importance	name
0	0.803333	Labor_Hours
1	0.069395	Average_Employees
2	0.009607	Coal_Supply_Region_Powder River Basin
3	0.006403	Mine_Type_Surface
4	0.003830	Coal_Supply_Region_Illinois Basin